A Default First Order Family Weight Determination Procedure for WPDV Models
نویسنده
چکیده
Weighted Probability Distribution Voting (WPDV) is a newly designed machine learning algorithm, for which research is currently aimed at the determination of good weighting schemes. This paper describes a simple yet effective weight determination procedure, which leads to models that can produce competitive results for a number of NLP classification tasks. 1 T h e W P D V a l g o r i t h m Weighted Probability Distribution Voting (WPDV) is a supervised learning approach to classification. A case which is to be classified is represented as a feature-value pair set: Fcase -{{fl : V l } , . . . , { f n :Vn}} An estimation of the probabilities of the various classes for the case in question is then based on the classes observed with similar feature-value pair sets in the training data. To be exact, the probability of class C for Fcase is estimated as a weighted sum over all possible subsets Fsub of Fcase: w /req(CJF b) P(C) = N(C) /req(F b) FsubCFcase with the frequencies (freq) measured on the training data, and N(C) a normalizing factor such that ~ / 5 ( C ) = 1. In principle, the weight factors WF,~,~ can be assigned per individual subset. For the time being, however, they are assigned for groups of subsets. First of all, it is possible to restrict the subsets that are taken into account in the model, using the size of the subset (e.g. Fsub contains at most 4 elements) and/or its frequency (e.g. Fsub occurs at least twice in the training material). Subsets which do not fulfil the chosen criteria are not used. For the subsets that are used, weight factors are not assigned per individual subset either, but rather per "family", where a family consists of those subsets which contain the same combination of feature types (i.e. the same f/). The two components of a WPDV model, distributions and weights, are determined separately. In this paper, I will use the term training set for the data on which the distributions are based and tuning set for the data on the basis of which the weights are selected. Whether these two sets should be disjunct or can coincide is one of the subjects under investigation. 2 F a m i l y w e i g h t s The various family weighting schemes can be classified according to the type of use they make of the tuning data. Here, I use a very rough classification, into weighting scheme orders. With 0 th o r d e r we igh t s , no information whatsoever is used about the data in tuning set. Examples of such rudimentary weighting schemes are the use of a weight of k! for all subsets containing k elements, as has been used e.g. for wordclass tagger combination (van Halteren et al., To appear), or even a uniform weight for all subsets. With 1 st o r d e r we igh t s , information is used about the individual feature types, i.e. WF,~b = IT WIt { i l ( f i :v i}eF, ub} First order weights ignore any possible interaction between two or more feature types, but
منابع مشابه
A Default First Order Family Weight Determination
Weighted Probability Distribution Voting (WPDV) is a newly designed machine learning algorithm, for which research is currently aimed at the determination of good weighting schemes. This paper describes a simple yet eeective weight determination procedure, which leads to models that can produce competitive results for a number of NLP classiication tasks. 1 The WPDV algorithm Weighted Probabilit...
متن کاملLocation Reparameterization and Default Priors for Statistical Analysis
This paper develops default priors for Bayesian analysis that reproduce familiar frequentist and Bayesian analyses for models that are exponential or location. For the vector parameter case there is an information adjustment that avoids the Bayesian marginalization paradoxes and properly targets the prior on the parameter of interest thus adjusting for any complicating nonlinearity the details ...
متن کاملWeighted probability distribution voting, an introduction
This paper introduces a new machine learning technique, Weighted Probability Distribution Voting (WPDV). During learning, WPDV determines the output class probability distribution for each input feature, both atomic and complex. During classiication, WPDV takes all input features that occur in the new input and adds the corresponding probability distributions , each multiplied by a weight facto...
متن کاملInvestigating the missing data effect on credit scoring rule based models: The case of an Iranian bank
Credit risk management is a process in which banks estimate probability of default (PD) for each loan applicant. Data sets of previous loan applicants are built by gathering their data, and these internal data sets are usually completed using external credit bureau’s data and finally used for estimating PD in banks. There is also a continuous interest for bank to use rule based classifiers to b...
متن کاملImplication of an Integrated Approach to the Determination of Water Saturation in a Carbonate Gas Reservoir Located in the Persian Gulf
Water saturation determination is one of the most important tasks in reservoir studies to predict oil and gas in place needed to be calculated with more accuracy. The estimation of this important reservoir parameter is commonly determined by various well logs data and by applying some correlations that may not be so accurate in some real practical cases, especially for carbonate reservoirs. Sin...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000